Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes

Identifieur interne : 002A10 ( Main/Exploration ); précédent : 002A09; suivant : 002A11

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes

Auteurs : Stefan Kurtz [Allemagne] ; Apurva Narechania [États-Unis] ; Joshua C. Stein [États-Unis] ; Doreen Ware [États-Unis]

Source :

RBID : PMC:2613927

Descripteurs français

English descriptors

Abstract

Background

The challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on counting occurrences of k-mers, has been previously used to distinguish TEs from low-copy genic regions; but currently available software solutions are impractical due to high memory requirements or specialization for specific user-tasks.

Results

Here we introduce the Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets. Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set of whole genome shotgun sequences from maize (B73) (total size 109 bp.). We analyzed k-mer frequencies for a wide range of k. At this low genome coverage (≈ 0.45×) highly repetitive 20-mers constituted 44% of the genome but represented only 1% of all possible k-mers. Similar low-complexity was seen in the repeat fractions of sorghum and rice. When applying our method to other maize data sets, High-C0t derived sequences showed the greatest enrichment for low-copy sequences. Among annotated TEs, the most highly repetitive were of the Ty3/gypsy class of retrotransposons, followed by the Ty1/copia class, and DNA transposons. Among expressed sequence tags (EST), a notable fraction contained high-copy k-mers, suggesting that transposons are still active in maize. Retrotransposons in Mo17 and McC cultivars were readily detected using the B73 20-mer frequency index, indicating their conservation despite extensive rearrangement across cultivars. Among one hundred annotated bacterial artificial chromosomes (BACs), k-mer frequency could be used to detect transposon-encoded genes with 92% sensitivity, compared to 96% using alignment-based repeat masking, while both methods showed 92% specificity.

Conclusion

The Tallymer software was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available. For more information on the software, see .


Url:
DOI: 10.1186/1471-2164-9-517
PubMed: 18976482
PubMed Central: 2613927


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes</title>
<author>
<name sortKey="Kurtz, Stefan" sort="Kurtz, Stefan" uniqKey="Kurtz S" first="Stefan" last="Kurtz">Stefan Kurtz</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Center for Bioinformatics, University of Hamburg, Bundesstraße 43, 20146 Hamburg, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Center for Bioinformatics, University of Hamburg, Bundesstraße 43, 20146 Hamburg</wicri:regionArea>
<wicri:noRegion>20146 Hamburg</wicri:noRegion>
<placeName>
<settlement type="city">Hambourg</settlement>
<region type="land" nuts="2">Hambourg</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Narechania, Apurva" sort="Narechania, Apurva" uniqKey="Narechania A" first="Apurva" last="Narechania">Apurva Narechania</name>
<affiliation wicri:level="2">
<nlm:aff id="I2">Cold Spring Harbor Lab, 1 Bungtown Rd, Williams #5, Cold Spring Harbor, NY 11724, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Cold Spring Harbor Lab, 1 Bungtown Rd, Williams #5, Cold Spring Harbor, NY 11724</wicri:regionArea>
<placeName>
<region type="state">État de New York</region>
</placeName>
</affiliation>
<affiliation wicri:level="2">
<nlm:aff id="I3">Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY</wicri:regionArea>
<placeName>
<region type="state">État de New York</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Stein, Joshua C" sort="Stein, Joshua C" uniqKey="Stein J" first="Joshua C" last="Stein">Joshua C. Stein</name>
<affiliation wicri:level="2">
<nlm:aff id="I2">Cold Spring Harbor Lab, 1 Bungtown Rd, Williams #5, Cold Spring Harbor, NY 11724, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Cold Spring Harbor Lab, 1 Bungtown Rd, Williams #5, Cold Spring Harbor, NY 11724</wicri:regionArea>
<placeName>
<region type="state">État de New York</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Ware, Doreen" sort="Ware, Doreen" uniqKey="Ware D" first="Doreen" last="Ware">Doreen Ware</name>
<affiliation wicri:level="2">
<nlm:aff id="I2">Cold Spring Harbor Lab, 1 Bungtown Rd, Williams #5, Cold Spring Harbor, NY 11724, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Cold Spring Harbor Lab, 1 Bungtown Rd, Williams #5, Cold Spring Harbor, NY 11724</wicri:regionArea>
<placeName>
<region type="state">État de New York</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">18976482</idno>
<idno type="pmc">2613927</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2613927</idno>
<idno type="RBID">PMC:2613927</idno>
<idno type="doi">10.1186/1471-2164-9-517</idno>
<date when="2008">2008</date>
<idno type="wicri:Area/Pmc/Corpus">000551</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000551</idno>
<idno type="wicri:Area/Pmc/Curation">000551</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000551</idno>
<idno type="wicri:Area/Pmc/Checkpoint">001405</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Checkpoint">001405</idno>
<idno type="wicri:source">PubMed</idno>
<idno type="RBID">pubmed:18976482</idno>
<idno type="wicri:Area/PubMed/Corpus">002060</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">002060</idno>
<idno type="wicri:Area/PubMed/Curation">002060</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">002060</idno>
<idno type="wicri:Area/PubMed/Checkpoint">002018</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">002018</idno>
<idno type="wicri:Area/Ncbi/Merge">000650</idno>
<idno type="wicri:Area/Ncbi/Curation">000650</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000650</idno>
<idno type="wicri:Area/Main/Merge">002A36</idno>
<idno type="wicri:Area/Main/Curation">002A10</idno>
<idno type="wicri:Area/Main/Exploration">002A10</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes</title>
<author>
<name sortKey="Kurtz, Stefan" sort="Kurtz, Stefan" uniqKey="Kurtz S" first="Stefan" last="Kurtz">Stefan Kurtz</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Center for Bioinformatics, University of Hamburg, Bundesstraße 43, 20146 Hamburg, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Center for Bioinformatics, University of Hamburg, Bundesstraße 43, 20146 Hamburg</wicri:regionArea>
<wicri:noRegion>20146 Hamburg</wicri:noRegion>
<placeName>
<settlement type="city">Hambourg</settlement>
<region type="land" nuts="2">Hambourg</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Narechania, Apurva" sort="Narechania, Apurva" uniqKey="Narechania A" first="Apurva" last="Narechania">Apurva Narechania</name>
<affiliation wicri:level="2">
<nlm:aff id="I2">Cold Spring Harbor Lab, 1 Bungtown Rd, Williams #5, Cold Spring Harbor, NY 11724, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Cold Spring Harbor Lab, 1 Bungtown Rd, Williams #5, Cold Spring Harbor, NY 11724</wicri:regionArea>
<placeName>
<region type="state">État de New York</region>
</placeName>
</affiliation>
<affiliation wicri:level="2">
<nlm:aff id="I3">Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY</wicri:regionArea>
<placeName>
<region type="state">État de New York</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Stein, Joshua C" sort="Stein, Joshua C" uniqKey="Stein J" first="Joshua C" last="Stein">Joshua C. Stein</name>
<affiliation wicri:level="2">
<nlm:aff id="I2">Cold Spring Harbor Lab, 1 Bungtown Rd, Williams #5, Cold Spring Harbor, NY 11724, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Cold Spring Harbor Lab, 1 Bungtown Rd, Williams #5, Cold Spring Harbor, NY 11724</wicri:regionArea>
<placeName>
<region type="state">État de New York</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Ware, Doreen" sort="Ware, Doreen" uniqKey="Ware D" first="Doreen" last="Ware">Doreen Ware</name>
<affiliation wicri:level="2">
<nlm:aff id="I2">Cold Spring Harbor Lab, 1 Bungtown Rd, Williams #5, Cold Spring Harbor, NY 11724, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Cold Spring Harbor Lab, 1 Bungtown Rd, Williams #5, Cold Spring Harbor, NY 11724</wicri:regionArea>
<placeName>
<region type="state">État de New York</region>
</placeName>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Genomics</title>
<idno type="eISSN">1471-2164</idno>
<imprint>
<date when="2008">2008</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Computational Biology (methods)</term>
<term>DNA Transposable Elements</term>
<term>Genome, Plant</term>
<term>Genomics (methods)</term>
<term>Methods</term>
<term>Oryza</term>
<term>Software</term>
<term>Sorghum</term>
<term>Zea mays</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr">
<term>Biologie informatique ()</term>
<term>Génome végétal</term>
<term>Génomique ()</term>
<term>Logiciel</term>
<term>Méthodes</term>
<term>Oryza</term>
<term>Sorghum</term>
<term>Zea mays</term>
<term>Éléments transposables d'ADN</term>
</keywords>
<keywords scheme="MESH" type="chemical" xml:lang="en">
<term>DNA Transposable Elements</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Computational Biology</term>
<term>Genomics</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Genome, Plant</term>
<term>Methods</term>
<term>Oryza</term>
<term>Software</term>
<term>Sorghum</term>
<term>Zea mays</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr">
<term>Biologie informatique</term>
<term>Génome végétal</term>
<term>Génomique</term>
<term>Logiciel</term>
<term>Méthodes</term>
<term>Oryza</term>
<term>Sorghum</term>
<term>Zea mays</term>
<term>Éléments transposables d'ADN</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>The challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on counting occurrences of
<italic>k</italic>
-mers, has been previously used to distinguish TEs from low-copy genic regions; but currently available software solutions are impractical due to high memory requirements or specialization for specific user-tasks.</p>
</sec>
<sec>
<title>Results</title>
<p>Here we introduce the Tallymer software, a flexible and memory-efficient collection of programs for
<italic>k</italic>
-mer counting and indexing of large sequence sets. Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the
<italic>k</italic>
-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set of whole genome shotgun sequences from maize (B73) (total size 10
<sup>9 </sup>
bp.). We analyzed
<italic>k</italic>
-mer frequencies for a wide range of
<italic>k</italic>
. At this low genome coverage (≈ 0.45×) highly repetitive 20-mers constituted 44% of the genome but represented only 1% of all possible
<italic>k</italic>
-mers. Similar low-complexity was seen in the repeat fractions of sorghum and rice. When applying our method to other maize data sets, High-
<italic>C</italic>
<sub>0</sub>
<italic>t </italic>
derived sequences showed the greatest enrichment for low-copy sequences. Among annotated TEs, the most highly repetitive were of the Ty3/gypsy class of retrotransposons, followed by the Ty1/copia class, and DNA transposons. Among expressed sequence tags (EST), a notable fraction contained high-copy
<italic>k</italic>
-mers, suggesting that transposons are still active in maize. Retrotransposons in Mo17 and McC cultivars were readily detected using the B73 20-mer frequency index, indicating their conservation despite extensive rearrangement across cultivars. Among one hundred annotated bacterial artificial chromosomes (BACs),
<italic>k</italic>
-mer frequency could be used to detect transposon-encoded genes with 92% sensitivity, compared to 96% using alignment-based repeat masking, while both methods showed 92% specificity.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>The Tallymer software was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available. For more information on the software, see
<ext-link ext-link-type="uri" xlink:href="http://www.zbh.uni-hamburg.de/Tallymer"></ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<affiliations>
<list>
<country>
<li>Allemagne</li>
<li>États-Unis</li>
</country>
<region>
<li>Hambourg</li>
<li>État de New York</li>
</region>
<settlement>
<li>Hambourg</li>
</settlement>
</list>
<tree>
<country name="Allemagne">
<region name="Hambourg">
<name sortKey="Kurtz, Stefan" sort="Kurtz, Stefan" uniqKey="Kurtz S" first="Stefan" last="Kurtz">Stefan Kurtz</name>
</region>
</country>
<country name="États-Unis">
<region name="État de New York">
<name sortKey="Narechania, Apurva" sort="Narechania, Apurva" uniqKey="Narechania A" first="Apurva" last="Narechania">Apurva Narechania</name>
</region>
<name sortKey="Narechania, Apurva" sort="Narechania, Apurva" uniqKey="Narechania A" first="Apurva" last="Narechania">Apurva Narechania</name>
<name sortKey="Stein, Joshua C" sort="Stein, Joshua C" uniqKey="Stein J" first="Joshua C" last="Stein">Joshua C. Stein</name>
<name sortKey="Ware, Doreen" sort="Ware, Doreen" uniqKey="Ware D" first="Doreen" last="Ware">Doreen Ware</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002A10 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002A10 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     PMC:2613927
   |texte=   A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i   -Sk "pubmed:18976482" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021